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Abstract 

This note presents some representative methods which are based on dictionary 
learning (DL) for classification. We do not review the sophisticated methods or 
frameworks that involve DL for classification, such as online DL and spatial pyramid 
matching (SPM), but rather, we concentrate on the direct DL-based classification 
methods. Here, the "so-called direct DL-based method" is the approach directly 
deals with DL framework by adding some meaningful penalty terms. By listing some 
representative methods, we can roughly divide them into two categories, i.e. (1) 
directly making the dictionary discriminative and (2) forcing the sparse coefficients 
discriminative to push the discrimination power of the dictionary. From this taxon- 
omy, we can expect some extensions of them as future researches. 



1 Introduction 

Dictionary learning (DL), as a particular sparse signal model, aims to learn a set of atoms, or called 
visual words in the computer vision community, in which a few atoms can be linearly combined to 
well approximate a given signal. From the view of compression sensing, it is originally designed to 
learn an adaptive codebook to faithfully represent the signals with sparsity constraint. In recent 
years, researchers have applied DL framework to other applications and achieved state-of-the-art 
performances, such as image denoising [3] and inpainting 4 , clustering [21 IS]) classification JTJ [B] , 
etc. 

It is well-known that the conventional DL framework is not adapted to classification as a result 
that the learned dictionary is merely used for signal reconstruction. Therefore, to circumvent this 
problem, researchers have developed several approaches to learn a classification-oriented dictionary 
in a supervised learning fashion by exploring the label information. In this note, we review the 
some existing representative DL-based classification methods. Through comparison, we can roughly 
divide them into two categories: (1) directly forcing the dictionary discriminative, or (2) making the 
sparse coefficients discriminative (usually through simultaneously learning a classifier) to promote 
the discrimination of the dictionary. The first category, named Track I in this note, mainly uses 
representation error for the final classification, whereas, the second category (Track II) can utilize 
the sparse coefficients as new feature representation for classification. 
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Table 1: Two categories of DL-based classification methods. 



Category 


Representative Approaches 


Track 1 
Track 11 


Meta-face learning [TJ, DLSI [5] 
SuperviscdDL 6 , D-KSVD [T3], LC-KSVD [5], Fisher DL QTJ 



Track 1 includes Meta-face learning [T3] and DL with structured incoherence [5] , and Track II contains 
supervised DL [6 , discriminative K-SVD [13., label consistence K-SVD [5] and Fisher discrimination 
DL [IT] . The abbreviations of these methods are listed in Table [1] 

The organization of this note is as follows. In the end of this section, we review an important method 
called sparse representation-based classification [TO], then introduce the general dictionary learning 
framework with notations used in this note. Note that even though SRC do not learn dictionaries, 
it opens the prologue of classification based on sparse coding technique. In Section [21 we introduce 
Meta-face learning [T2j and DLSI [8] as two specific examples of Track I, which uses the reconstruction 
error for the final classification like what SRC does. Its counterpart, i.e. Track II, will presented in 
Section^] including SupervisedDL [6J, D-KSVD [13], LC-KSVD [5| and FisherDL [TT]- In SectionH 
we give a brief summary on DL-based classification methods, and expect some extensions in the 
future work. 

1.1 Sparse Representation-Based Classification 

Wright et al. [10 propose the sparse representation based classification (SRC) method for robust 
face recognition, and achieve very impressive results. Suppose there are C classes of individual faces, 
let D = [Xi, . . . , X c , . . . , X c ] G R dxN be the set of original training samples, where X c G R dxN <= i s 
the sub-set of all the N c vector-represented training samples from class c. SRC treats the original 
data set as an overall dictionary. Denote by x G K d a query facial image, then SRC identifies x as 
the following two-stage procedure: 

1. sparsely code x over X via ^i-norm minimization 

a= argmin||x-Da||2 + A||a||i, (1) 

a 

where A is a scalar constant. 

2. identify x to the c class that 

c = axgmin[[x- Xj5i(a)|||, (2) 

i 

where Si(-) is a vector indicator function that extract the elements corresponding to the i th 
class. 

SRC achieves very impressive performance in face recognition, and robust to noises such as occlusion, 
lighting, etc. Even if SRC learns no dictionaries for classification, it acts as one vanguard to open the 
prologue of classification with the help of sparse coding. In this view, we can see SRC naively uses 
all the training samples as one dictionary, wherein the class-specific training sets are sub-dictionaries 
contributing to discrimination. 
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1.2 Dictionary Learning Framework 



Learning an adaptive dictionary (possible overcomplete) aims to provide a basis pool in which a few 
bases can be linearly combined to approximate a novel signal. Suppose there are a set of signals, 
denoted by X = [xi, . . . ,Xj, . . . , Xjy], where Xj is the i th signal. Then the conventional dictionary 
learning framework learns the dictionary as below: 

N 

{A,D}= argmin ^ ||x i -Da i ||l + A||a< ||i 

DGK dx/f i=] 
AeR KxJV 

= argmin ||X - DA||| + A||A|| X (3) 

DeR dx/f 

Aei KxJV 

s.t. \\d t \\j < 1, for Vi = 1,...,N, 
where A = [ai, . . . , ajv] is the coefficient matrix and ||A||i = || a i||i- 

It is widely known that classic dictionary learning framework is designed for a reconstruction task 
instead of classification tasks, even if good classification results are achieved in the literature. It is 
believed that classification performance will be further improved if we carefully learn a classification- 
oriented dictionary. In next section, we will have a look at several DL-based classification methods 
belonging to Track I. 

2 Track I: Directly Making the Dictionary Discriminative 

The methods from Track I use the reconstruction error for the final classification, thus the learned 
dictionary ought to be as discriminative as possible. Inspired by SRC, Yang et al. propose meta-face 
learning |12) to learn an adaptive dictionary for each class, and Ramirez et al. add a sophisticated 
term to derive more delicate classification-oriented dictionaries. Now, we present the two methods. 

2.1 Meta-Face Learning 

SRC directly adopts the original facial images as the dictionary, however, as discussed in [12], this 
pre-defined dictionary will incorporate much redundancy as well as noise and trivial information 
that can can be negative to the face recognition. Additionally, when the training data grows, the 
computation of sparse coding will become a main bottleneck. Focusing on this problem, Yang et 
al. [12] propose a Metaface learning method to learn a class-specific dictionary for each object: 

Di = argmin ||X 4 - D^Ul -I- A||Aj||i, 

(4) 

s.t. |[d}|| 2 <l,Vj = l,...,X, 

where matrix Xj £ R dxA, i contains all the training images from the i th class as its columns, d* 
is the j th column of the i th class-specific sub-dictionary = [d|,...,d^-] £ R dxK } and ||Aj||i 
is defined as the summation of £i-norm of all the columns of Aj = [&[,..., sl 1 n ] £ ~R KxNi , i.e. 

II -A-t ||i = ll a jlli- Metaface learning method concatenates all the sub-dictionaries as an overall 
dictionary D = [Di, . . . , Dc] for classification, the same as the second stage of SRC. 
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2.2 Dictionary Learning with Structured Incoherence 



Ramirez et al. note that the learned sub-dictionaries may share some common bases, i.e. some 
visual words from different sub-dictionaries can be very coherent [8] . Undoubtedly, the coherence of 
the atoms can be used for reconstructing the query image interchangeably, and the reconstruction 
error based classifier will fail in identifying some queries. To circumvent this problem, they add 
an incoherence term term to drive the dictionaries associated to different classes as independent as 
possible. 

The incoherence term is denoted as Q(Dj,Dj) = ||DfDj|||,. It is easy to see this term drives 
the atoms from different sub-dictionaries to be as independent /incoherent as possible. Therefore, 
Ramirez et al. derive the final dictionary learning method with structured incoherence as below: 

ln "l in E|ll X * -D^ + AIIA^U V J2 PfD.-Hl, (5) 

{D - Ai}i=1 ' - i I J m 

where Aj = [aj,... , a"*] <E K fc<Xn< , each column is the sparse code corresponding to the signal 
j S [1, . . . , nil in class i. 

They empirically note that even though the incoherence term is imposed in the dictionaries, atoms 
representing common features in all classes tend to appear repeated almost exactly in dictionaries 
corresponding to different classes [8]. Being so common, these atoms are used often and their 
associated reconstruction coefficients have a high absolute value |a r |, r 6 {1, . . . , fcj}, thus making 
the reconstruction costs similar. They further propose to detect such atoms is to inspect the already 
available D^Dj matrices, whose absolute values represent the inner products between atoms. By 
ignoring the coefficients associated to these common atoms when computing the reconstruction error, 
they improve the discriminatory power of the system. 



3 Track II: Making the Coefficients Discriminative 

Track II is different from Track I in the way of discrimination. Contrary to Track I, it forces the sparse 
coefficients to be discriminative, and indirectly propagates the discrimination power to the overall 
dictionary. Track II only need to learn an overall dictionary, instead of class-specific dictionaries. In 
this section, we list several recent-proposed methods belonging to Track II. 



3.1 Supervised Dictionary Learning 

Before presenting this method, we have to clarify that the Supervised DL (SupervisedDL) method 
is a specific approach proposed in [5], regardless of other possible supervised DL framework. 

Mairal et al. propose to combine the logistic regression with conventional dictionary learning frame- 
work as below: 



N 

(A, D) = argmin ^(C(y 4 /( Xl , a,, 9)) + A ||x 4 - Da*^ + AJa^d) + A 2 ||0||2, 

i=x 

DgR dxif 

A£l KxW 

s.t. lldilH < 1, for Vi = l,...,N, 



(6) 



where C is the logistic loss function (C(x) = log(l + e x )), which enjoys properties similar to that 
of the hinge loss from the SVM literature, while being differentiable, and A2 is a regularization 
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parameter which prevents ovcrfitting. This is the approach chosen in [TJ. And / is a classification 
function — linear in a: /(x, a, 6) = 9 T a. + b wherein 8 £ M. K , or bilinear in a and x: /(x, a, 6) = 
x T Wa + b wherein 6 = {W e R dxK , b 6 M}. 

3.2 Discriminative K-SVD for Dictionary Learning 

Zhang and Li propose discriminative K-SVD (D-KSVD) to simultaneously achieve a desired dictio- 
nary which has good representation power while supporting optimal discrimination of the classes |13) . 
D-KSVD adds a simple linear regression as a penalty term to the conventional DL framework: 

(D,W, A) = argmin ||X - DA[|| + AJH - WA||| + A 2 ||A[|i + A 3 ||W||^, (7) 

D,W,A v ; 

where H = [hi, . . . , h/v] G M c ' xjv is the label of the training images, in which h„ = [0, . . . , 0, 1, 0, . . . , 0]: 
the position of non-zero element indicates the class. And W is the parameter of the classifier, Ai, 
A2 and A3 are scalars controlling the relative contribution of the corresponding terms. 

Note that the first two terms can be fused into one, and the term ||W|||i can be dropped during 
computation owing to the protocol of the original K-SVD algorithm(details in [T3]). After obtaining 
the classifier parameter W and the dictionary, the final classification can be very fast for a query 
image. 

3.3 Label Consistent K-SVD 

Jiang et ol. propose a label consistent K-SVD (LC-KSVD) method to learn a discriminative dic- 
tionary for sparse coding [5], They introduce a label consistent constraint called "discriminative 
sparse-code error" , and combine it with the reconstruction error and the classification error to form 
a unified objective function as below: 

(D, W, A) = argmin ||X - DA||| + Ai||Q - GA||| + A 2 ||H - WA||| + A 3 ||A|| a 

D,W,A (g) 

s.t. Ildilll < 1, for Vf = 1, . . -,iV, 

where H and W are the same as that of D-KSVD described in the previous subsection, Q = 
[qi, . . . , qjv] £ R KxN is the label consistence term. Here q„ = [0, . . . , 1, . . . , 1, 0, . . . , 0] T € is an 
indicator corresponding to the input signal x„ from suitable class: the non-zero values of q n occur 
at those indices where the input signal x„ and the dictionary codeword share the same label. 

The term ||Q — GA||p represents the discriminative sparse-code error, which enforces that the sparse 
codes A approximate the discriminative sparse codes Q. It forces the signals from the same class 
to have very similar sparse representations, i.e. encouraging label consistency in resulting sparse 
codes. At the same time, the linear regression term ||H — WA||p is added, which is the same as that 
of D-KSVD [13]. Intuitively, the final classification mechanism is very fast owing to the classifier 
parameter matrix W. 

3.4 Fisher Discriminant Dictionary Learning 

Yang et al. propose Fisher discrimination dictionary learning (FisherDL) method based on the Fisher 
criterion to learn a structured dictionary [TT] , whose atom has correspondence to the class label. The 
structured dictionary is denoted as D = [D 1; . . . ,Dc], where D c is the class-specific sub-dictionary 
associated with the c th class. Denote the data set X = [Xi, . . . , Xp], where X c is the sub-set of the 
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training samples from the c th class. Then they solve the following formulation over the dictionary 
and the coefficients to derive the desired discriminative dictionary: 

(D, A) = argmin C(X, D, A) + Aj ||A]]i + A 2 /(A), 

DeR d><K , , 

Am KxN (9) 
s.t. Hdillf <l,forVi = l,...,JV, 

where C(X,D, A) is the discriminative fidelity term (pending to discuss it as below); ||A||i is the 
sparsity constraint; /(A) is a discrimination constraint (as discussed below) imposed on the coeffi- 
cient matrix A. 

The discriminative fidelity term We can write Aj, the representation of Xj over D, as A,; = 
[Aj; . . . ; A?; . . . ; Af ], where A? is the coding coefficient of Xj over the sub-dictionary D c . Denote 
the representation of D c to Xj as R c = D c Af. First of all, the dictionary D should be able to well 

represent Xj, and there is X, DAj = D X A^H \-DjA{-\ hD c Af = R, |-RjH hRc- 

Second, since Dj is associated with the i th class, it is expected that Xj should be well represented 
by Dj but not by Dj, j ^ i. This implies that A* should have some significant coefficients such that 
X,; — DjA* is small, while A^ should have nearly zero coefficients such that Dj -A? is small. Thus 
the discriminative fidelity term is defined as: 

C(Xj,D, Aj) =||Xj - DA, |||, + ||Xj - DjA^||| + ^; llD.Allll, (1Q) 

The discriminative coefficient term To make dictionary D be discriminative for the samples 
in X, we can make the coding coefficient of X over D, i.e. A, be discriminative. Based on Fisher 
Criterion, this can be achieved by minimizing the within-class scatter of A, denoted by Sw and 
maximizing the between-class scatter of A, denoted by Sb- Sw and S# are defined as: 



= X! ( a i - m c)(a,: - m c ) 2 

c=l x;EX c 

c 

S B = ^( mc ~ m )( m c _ m ) T 



c=l 



Intuitively, we can define /(A) as tr(Siy) — tr(Ss). However, such an /(A) is non-convex and 
unstable. To solve this problem, we propose to add an elastic term ||A|||, into /(A): 

f(A)=tr(S w )-tr(S B ) + V \\Af F (II) 



Incorporating all the terms, we have the following FDDL model: 

(D, A) = argmin ( f) C(Xj,D, Aj) + X 2 (tr(S w ) - tr(S B ) + + AxljAHi j (12) 

D, A I c =l J 

There are some crucial issues related to their model, such as the convexity of /(A) and sparse coding, 
and they discuss these issue in depth [TT]. As for classification, they still utilize the reconstruction 
error as that of Track I. 
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4 Summary 



In previous two sections, we review some representative DL-based classification approaches, both 
from Track I and Track II. Obviously, it is intuitive but effective to add some sophisticated discrim- 
ination term to the conventional DL framework to derive a well-learned dictionary for classification. 

If we check these methods, we can anticipate a general framework here: 

min C(Y, X, D, A) + r?/(W, A, Y) + \ A h A (A) + X w h w (W) 

D.W.A (13) 

s.t. constraint on D, 

where C(Y,X,D,A) is the conventional DL framework, /(W, A,Y) is the discrimination term on 
the sparse coefficients, /ia and /iw are the Lagrange constraints on the sparse coefficient matrix A 
and the projector W, n and A's are scalars to balance their weights. Note W does not necessarily 
mean only one projector, but rather represents several ones. From Eq. 1131 we can see that, by 
employing the label matrix Y, the discriminative dictionary can be learned directly in the term 
C(Y, X, D, A), at the same time, the term /(W, A, Y) can also propagate the discrimination power 
of the coefficients to the dictionary, making the dictionary even more discriminative and reliable for 
classification. Obviously, if we set rj = 0, Eq.[T3l degrades to Track I; if we omit the label information 
in term C(Y, X, D, A), Eq. [13] degenerates to Track II. Note that FisherDL [11] can also be cast as 
a specific example of Eq. 1131 which drives the dictionary to be as discriminative as possible from 
two directions (direct push and indirect push by the coefficients). 

Besides, the main concern seems to be the trade-off between the classification accuracy and the 
complexity of formulation. Furthermore, when meeting large scale database, these methods will be 
time consuming in learning the dictionary. Therefore, how to extend these method to online version 
is an interesting but significant research. 
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